Efficient Search in Hidden Text of Large DjVu Documents

نویسنده

Janusz S. Bien

چکیده

The paper describes an open-source tool which allows to present endusers with results of advanced language technologies. It relies on the DjVu format, which for some applications is still superior to other modern formats including PDF/A. The DjVu GPLed tools are not limited just to the DjVuLibre library, but are being supplemented by various new programs, such as pdf2djvu developed by Jakub Wilk. It allows in particular to convert to DjVu the PDF output of popular OCR programs like FineReader preserving the hidden text layer and some other features. The tool in question has been conceived by the present author and consist of a modification of the Poliqarp corpus query tool, used for National Corpus of Polish; his ideas have been very succesfully implemented by Jakub Wilk. The new system, called here simply Poliqarp for DjVu, inherits from its origin not only the powerfull search facilities based on two-level regular expressions, but also the ability to represent low-level ambiguities and other linguistic phenomena. Although at present the tool is used mainly to facilitate access to the results of dirty OCR, it is ready to handle also more sophisticated output of linguistic technologies. 1 DjVu technology and DjVuLibre The DjVu technology, described by its authors as an image compression technique, a document format, and a software platform for delivering documents images over the Internet [Le Cun et al., 2001, p. 2] was originally developed by Yann Le Cun, Léon Bottou, Patrick Haffner, and Paul G. Howard at AT&T Laboratories in 1996. AT&T Laboratories acquired several patents for some aspects of the technology, but didn’t offer any product using or supporting DjVu1. The broad rights to the patents have been purchased by LizardTech ∗This is an updated version of the paper which appeared in Bernardi, Raffaella and Chambers, Sally and Gottfried, Björn and Segond, Frédérique and Zaihrayeu, Ilya (eds.), Advanced Language Technologies for Digital Libraries, Lecture Notes in Computer Science 6999, Springer Berlin / Heidelberg, pp 1-14, 2011, DOI 10.1007/978-3-642-23160-5_1 (http://dx.doi.org/10.1007/978-3-642-23160-5_1). †Formal Linguistics department, University of Warsaw,Browarna 8/10, 00-927 Warszawa, Poland, [email protected], http://www.klf.uw.edu.pl. 1Although the patents in question are valid only in USA, they definitely delayed the practical applications of the format (fortunately software patents are not allowed at all in European Union and a lot of other countries).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Color Documents on the Web with DJVU

We present a new image compression technique called \DjVu" that is speci cally geared towards the compression of scanned documents in color at high resolution. With DjVu, a magazine page in color at 300dpi typically occupies between 40KB and 80KB, approximately 5 to 10 times better than JPEG for a similar level of readability. Using a combination of Hidden Markov Model techniques and MDL-driven...

متن کامل

Electronic Document Publishing Using DjVu

Online access to complex compound documents with client side search and browsing capability is one of the key requirements of effective content management. “DjVu” (Déjà Vu) is a highly efficient document image compression methodology, a file format, and a delivery platform that, when considered together, has shown to effectively address these issues [1]. Originally developed for scanned color d...

متن کامل

A General Segmentation Scheme for Djvu Document Compression

We describe the “DjVu” (Déjà Vu) technology: an efficient document image compression methodology, a file format, and a delivery platform that together, enable instant access to high quality documents from essentially any platform, over any connection. Originally developed for scanned color documents, it was recently expanded to electronic documents, so DjVu has now truly become a universal docu...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل